Parallelizing K-means with Hadoop/Mahout for Big Data Analytics
نویسندگان
چکیده
The rapid development of Internet and cloud computing technologies has led to explosive generation and processing of huge amounts of data. The ever increasing data volumes bring great values to societies, but in the meantime bring forward a number of challenges. Data mining techniques have been widely used in decision analysis in financial, medical, management, business and many other fields. However, how to analyse and mine valuable information from the massive data has become a crucial problem as the traditional methods are hardly to achieve high scalability in data processing. Recently, MapReduce has emerged into a major programming model in dealing with big data analytics. Apache Hadoop, which is an open-source implementation of MapReduce, has been widely taken up by the community. Hadoop facilitates the utilization of a large number of inexpensive commodity computers. In addition, Hadoop provides support in dealing with faults which is especially useful for long running jobs. Mahout is a new open-source project of Apache, providing a number of machine learning and data mining algorithms based on the Hadoop platform. As a machine learning technique, K-means has been widely used in data analytics through clustering. However, K-means experiences high overhead in computation when the size of data to be analysed is large. This thesis parallelizes K-means using the MapReduce model and implements a parallel K-means with Mahout on the Hadoop platform. The parallel K-means reduces the computation time significantly in comparison with the standard K-means in dealing with a large data set. In addition, this thesis further evaluates the impact of Hadoop parameters on the performance of the Hadoop framework. reference has been made to the work of others, this thesis is the result of my own work. No part of this thesis has been submitted elsewhere for any other degree or qualification.
منابع مشابه
A Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملSpatio-Temporal Big Data Analytics for Environmental Health
The framework for our proposed big data analytics platform is shown in Figure 1. Two complimentary systems support the wide variety of spatial analytics algorithms and techniques we are providing. On the left half of Figure 1, the more-traditional unix filesystem supports high-throughput computation (e.g., MPI [Snir et al., 1995], OpenMP [Dagum and Menon, 1998], GPGPU/CUDA Luebke et al. [2006])...
متن کاملFeedback - Study and Improvement of the Random Forest of the Mahout library in the context of marketing data of Orange
In the realm of Big Data systems, Hadoop has emerged as one of the most popular systems and a very diverse ecosystem has grown around it, meeting all kinds of functional and technical needs. One niche that should have been a place of choice in this ecosystem is data analytics: first because getting value out of large datasets requires efficient Machine Learning (ML) algorithms, second because l...
متن کاملHadoop Based Big Data Clustering using Genetic & K-Means Algorithm
This is the era of huge and large sets of data or can say Big Data. Clustering of Big data plays several important roles for Big Data analytics. In this paper, we are introducing Big Data clustering algorithm by combining Genetic and K-Means algorithm using Hadoop framework. The major aim of this hybrid algorithm is to make clustering process faster and also raise the accuracy of resultant clus...
متن کاملA BigBench Implementation in the Hadoop Ecosystem
BigBench is the first proposal for an end to end big data analytics benchmark. It features a rich query set with complex, realistic queries. BigBench was developed based on the decision support benchmark TPC-DS. The first proof-of-concept implementation was built for the Teradata Aster parallel database system and the queries were formulated in the proprietary SQL-MR query language. To test oth...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015